Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
How are you feeling right now?
Apprehensive about the uncertainties of the future.
What do you expect to learn?
Basic R tasks.
Where did you hear about the course?
Sisu.
It seems the book has a lot of content, probably quite useful.
I’m familiar with Markdown but not with R.
Repo link: https://github.com/gartenfeld/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sat Dec 9 17:28:25 2023"
The text continues here.
This week we follow through the examples in Exercise Set 2 and learn about how to do linear regression and beyond with R.
A few words about the data set:
Assuming that you the reader have no no previous knowledge of the data – don’t worry, neither had I just a few minutes ago – from the appearance of it, there was a survey about study methods and “attitude” survey filled in by just under 200 Finnish pre-retirees. There were also exam points, so they were perhaps also given some kind of tests. More details here.
Given the schema of the data, now reverse-guessing the possible hypotheses, it’s possible that the claim was some attitude or strategy predicted exam scores. That’s kind of reasonable, maybe people who believe stats can“hyötyä” would score higher than people who don’t believe that.
date()
## [1] "Sat Dec 9 17:28:25 2023"
library(readr)
data <- read_csv('data/learning_2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(data)
## [1] 166 7
The data contains 166 survey responses. Each response contains 7 columns (variables), namely: gender, age of the participant, attitude score (being the average of 10 attitude questions about attitudes towards learning about statistics), deep/surface/strategic metrics (the average of different categories of questions related to learning approaches or methods) – see more details in this document and the points from some exam or test.
So the general gist is – you may have guessed it – what influences study outcomes in statistics? Is it attitude towards learning? Is it approach and life hacks in learning? Let’s find out.
“Show a graphical overview”, a “plot of everything”:
library(GGally)
library(ggplot2)
p <- ggpairs(data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
The histograms (on the left-hand side) are quite useful for showing the distributions, we can see for example at a glance there are actually a lot of younger students who answered the survey, even though the first few rows have older people.
The box plots show how gender correlate with the other variables.
The scatter plots are not so easy for humans to eyeball the trends.
But, what’s most interesting for us may be the vertical strip of printouts where “points” correlate with other variables.
Let’s pick the three largest values: attitude,
stra, surf
p1 <- ggplot(data, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p2
Seems a bit all over the place. Let’s add a regression line:
p3 <- p2 + geom_smooth(method = "lm")
# Draw plot
p3
## `geom_smooth()` using formula = 'y ~ x'
Now, onto fitting a regression model with all three variables!
multi_model <- lm(points ~ attitude + stra + surf, data = data)
multi_model
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Coefficients:
## (Intercept) attitude stra surf
## 11.0171 3.3952 0.8531 -0.5861
summary(multi_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Looking at the fitted coefficients (r-squareds), attitude has a big influence, stra has a much smaller influence, surf has a smaller still and negative influence.
# Residuals vs Fitted values (1)
p_a = plot(multi_model, which = 1)
# Normal QQ-plot (2)
p_b = plot(multi_model, which = 2)
# Residuals vs Leverage (5)
p_c = plot(multi_model, which = 5)
According to the internet, the assumptions of a linear model include:
These are basically saying that the overall assumption of a linear model is that there is a linear relationship between the distribution of the outcome dots and the input 🤷♀️ Yeah, sounds tautological but a (sp)line only makes sense if a plausible underlying causal line exists… And when it does exist the dots tend to have a evenly spread shape.
The quantile-quantile looks pretty close to the prediction. Residuals seem quite evenly spread across, maybe slightly top-heavy. There are a bunch of points that have high leverage/influence on the outcome.
This week we follow through the examples in Exercise Set 3.
date()
## [1] "Sat Dec 9 17:28:28 2023"
Create this file and include it in index.Rmd.
Read the joined student alcohol consumption data into R
library(readr)
data <- read_csv('data/drunk_kids.csv')
Print out the names of the variables in the data
names(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Describe the data set briefly, assuming the reader has no previous knowledge of it
The data is about high-schoolers in Portugal and their grades. The two datasets are from two subjects: maths and Portugese. G1 and G2 are from the first two periods, G3 is the final grade. The questionnaire includes a bunch of socio-economic, behavioural, and other factors. In this exercise we wonder if binge drinking kids have worse study outcomes.
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.
Choices of variables and reasoning:
sex: interesting to see if there’s a gender difference
in consumptionromantic: cool kids drink more often and socialise
morefamrel: kids with bad family relationship drink
moregoout: kids go out with friends and drinkNumerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
library(dplyr)
data %>% count(sex)
## # A tibble: 2 × 2
## sex n
## <chr> <int>
## 1 F 195
## 2 M 175
library(ggplot2)
data_for_plot <- tidyr::gather(data, key = "timing", value = "combined_use", Dalc, Walc)
ggplot(data_for_plot, aes(x = sex, y = combined_use, fill = timing)) +
geom_boxplot(position = "dodge", alpha = 0.2) +
labs(
x = "Sex",
y = "Alcohol Consumption (5 = Very high)",
fill = "Occasion",
) +
scale_fill_manual(
values = c("Dalc" = "lightblue", "Walc" = "lightgreen"),
labels = c("Dalc" = "Weekday", "Walc" = "Weekend")
)
Comments:
library(ggplot2)
ggplot(data, aes(x = romantic, y = alc_use, fill = romantic)) +
geom_boxplot(position = position_dodge(0.8), alpha = 0.8) +
labs(title = "Alcohol Consumption by Sex and Romantic Relationship",
x = "Sex",
y = "Alcohol Consumption",
fill = "Romantic") +
scale_fill_manual(values = c("yes" = "pink", "no" = "lightgrey"),
name = "In Relationship",
labels = c("yes" = "Yes", "no" = "No")) +
facet_grid(. ~ sex)
Comments:
library(ggplot2)
ggplot(data, aes(x = famrel, y = alc_use, color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Alcohol Consumption and Family Relationship",
x = "Family Relationship",
y = "Alcohol Consumption",
color = "Sex") +
scale_color_manual(values = c("M" = "deepskyblue", "F" = "pink"),
name = "Sex")
Comments:
library(ggplot2)
ggplot(data, aes(x = goout, y = alc_use, color = sex)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Alcohol Consumption and Going Out",
x = "Going Out with Friends",
y = "Alcohol Consumption",
color = "Sex") +
scale_color_manual(values = c("M" = "deepskyblue", "F" = "pink"),
name = "Sex")
Comments:
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")
Present and interpret a summary of the fitted model.
summary(alco_predictor)
##
## Call:
## glm(formula = high_use ~ famrel + goout, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9404 0.6179 -3.140 0.00169 **
## famrel -0.3891 0.1350 -2.882 0.00395 **
## goout 0.7923 0.1193 6.640 3.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 394.65 on 367 degrees of freedom
## AIC: 400.65
##
## Number of Fisher Scoring iterations: 4
Interpretation:
Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.
coef(alco_predictor)
## (Intercept) famrel goout
## -1.9403921 -0.3890518 0.7923023
Interpretation:
goout is associated with an increase in
the log-odds of high_use, while an increase in
famrel is associated with a decrease in the log-odds of
high_use.Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions.
alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")
probs <- predict(alco_predictor, type = "response")
library(dplyr)
forwards_data <- mutate(data, probability = probs)
forwards_data <- mutate(forwards_data, prediction = probability > 0.5)
table(high_use = forwards_data$high_use, prediction = forwards_data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 233 26
## TRUE 64 47
Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results.
TP <- 47
TN <- 233
FP <- 26
FN <- 64
training_error <- (FP + FN) / (TP + TN + FP + FN)
training_error
## [1] 0.2432432
Compare the performance of the model with performance achieved by some simple guessing strategy.
So it gets a quarter of the cases wrong. Personally, I’m not sure what is a “simple guessing strategy”, it’s better than a coin toss, but given the input of family relationship and going out, plus some human intuition, guessing could have fared better or worse. But 75+% correct classification seems to be a rather good first-try result in machine learning learning land.
Perform 10-fold cross-validation on your model.
data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")
# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Perform cross-validation using caret
cv_results <- train(high_use ~ famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)
# View cross-validation results
cv_results
## Generalized Linear Model
##
## 370 samples
## 2 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 332, 333, 334, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7458472 0.3127632
The prediction is 74% accurate and quite a bit better than chance.
The performance of this model is on par with the “model introduced in the Exercise Set (which had about 0.26 error)”.
Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.
Noniin, let’s start with a “very high number of predictors”.
data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + freetime + health + failures + paid + absences + G1 + G2 + G3 + famrel + goout, data = data, family = "binomial")
# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Perform cross-validation using caret
cv_results <- train(high_use ~ address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + freetime + health + failures + paid + absences + G1 + G2 + G3 + famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)
# View cross-validation results
cv_results
## Generalized Linear Model
##
## 370 samples
## 28 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 333, 332, 334, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7380117 0.3259937
Not much better…
Now, with hand-picked variables:
data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ address + Pstatus + higher + absences + famrel + goout, data = data, family = "binomial")
# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Perform cross-validation using caret
cv_results <- train(high_use ~ address + Pstatus + higher + absences + famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)
# View cross-validation results
cv_results
## Generalized Linear Model
##
## 370 samples
## 6 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 334, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7537221 0.3422244
After manually trying a bunch of combos, it doesn’t seem to get much better than this… due to the random round-robin cross-validation, the value also jumps up and down making it hard to eyeball whether it’s better or worse.
But wait, what about…
data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ Walc + Dalc, data = data, family = "binomial")
# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Perform cross-validation using caret
cv_results <- train(high_use ~ Walc + Dalc, data = data, method = "glm", family = "binomial", trControl = ctrl)
# View cross-validation results
cv_results
## Generalized Linear Model
##
## 370 samples
## 2 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 333, 334, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 1 1
Well, statistical truths are often tautological.
This week we follow through the examples in Exercise Set 4.
date()
## [1] "Sat Dec 9 17:28:30 2023"
Create this file and include it in index.Rmd.
“Load the Boston data from the MASS package”
# Load MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
“Explore the structure and the dimensions of the data”
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
“Describe the dataset briefly, assuming the reader has no previous knowledge of it.”
The dataset comes from a study by Harrison and Rubinfeld (1978) on
house prices in Boston suburbs. It’s commonly used for teaching stats
(specifically regression analysis, often called “machine learning” these
days) by examples of predicting real estate prices from a number of
demographic and socio-economic factors. The dataset contains 506
observations (rows) and 13 predictor variables plus the dependent
variable (medv, median
value of owner-occupied homes). See documentation.
“Show a graphical overview of the data”
library(GGally)
library(ggplot2)
library(dplyr)
p <- ggpairs(Boston)
p
cor_matrix <- cor(Boston) %>% round(digits = 2)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle", diag = FALSE, type = 'lower')
“Show summaries of the variables in the data”
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
“Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
Looking at the bottom row medv as an example, we can see
that rm (“average number of rooms per dwelling”) is
strongly positively correlated with medv and that
lstat (“lower status of the population in %”) is strongly
negatively correlated. This is not surprising simply from reading the
variable descriptions, that is, they are re-stating intuitive claims
such as “larger houses cost more”, “poor people live in cheap
neighbourhoods”.
“Standardize the dataset and print out summaries of the scaled data.”
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
“How did the variables change?”
They are now how many standard deviations either side, general 0 to 3, 3+ being the extremes.
“Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate).”
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
“Use the quantiles as the break points in the categorical variable.”
quantile_labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = quantile_labels, include.lowest = TRUE)
boston_scaled <- data.frame(boston_scaled, crime)
“Drop the old crime rate variable from the dataset.”
boston_scaled <- dplyr::select(boston_scaled, -crim)
“Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.”
num_observations <- nrow(boston_scaled)
sample_ids <- sample(num_observations, size = num_observations * 0.8)
train_set <- boston_scaled[sample_ids,]
test_set <- boston_scaled[-sample_ids,]
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
“Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.”
# `crime` as the target variable, and "." (all the others) as predicators
lda.fit <- lda(crime ~ ., data = train_set)
classes <- as.numeric(train_set$crime)
“Draw the LDA biplot.”
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
“Save the crime categories from the test set”
correct_crime_level_predictions <- test_set$crime
“Remove the categorical crime variable from the test dataset.”
test_set <- dplyr::select(test_set, -crime)
“Then predict the classes with the LDA model on the test data.”
lda.pred <- predict(lda.fit, newdata = test_set)
“Cross tabulate the results with the crime categories from the test set.”
table(correct = correct_crime_level_predictions, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 21 10 0 0
## med_low 4 13 4 0
## med_high 1 10 14 1
## high 0 0 0 24
“Comment on the results.”
Low crime areas are vert accurately predicted, 19 are really low, the other 1 medium low. Similar comments for high crime predictions, 26 spot on, 3 in adjacent category. The two categories in the middle got sometimes more mixed up either way, around 2/3 of the time correct.
“Reload the Boston dataset.”
library(MASS)
data("Boston")
“Scale the variables to get comparable distances.”
boston_scaled <- scale(Boston)
“Calculate the distances between the observations.”
dist_eu <- dist(boston_scaled)
“Run k-means algorithm on the dataset.”
km <- kmeans(boston_scaled, centers = 4)
“Investigate what is the optimal number of clusters and run the algorithm again.”
set.seed(42)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Spot the cliff
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
“Visualize the clusters”
km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
“Interpret the results”
The idea of clustering observations into groups is predicated on the notion that there are natural or hidden categories in the observations, namely, we could ask “how many types of suburbs are there?” The grouping is easily spotted visually when – for simplicity’s sake, let’s imagine two classes – two classes have metrics that are well away from each other. We can then see on a scatter plot that one group is over here in one corner, and the other group is way over there in an opposite corner. In terms referred to in the above steps, that there is a great distance between those two clouds. Here we have more than 2 dimensions, we have a dozen or so variables, but that’s not a problem for the machine, similar principles apply – that is, we try to find groups that have centres away from each other, all these dimensions considered. The chart is a bit hard to read but upon zooming it you could see that sometimes the groups are more distinct.
“Perform k-means on the original Boston data with some reasonable number of clusters (> 2).”
km <- kmeans(boston_scaled, centers = 3)
“Then perform LDA using the clusters as target classes.”
cluster <- km$cluster
boston_scaled <- data.frame(boston_scaled, cluster)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv cluster
## Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:1.000
## Median :-0.1811 Median :-0.1449 Median :2.000
## Mean : 0.0000 Mean : 0.0000 Mean :1.941
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:3.000
## Max. : 3.5453 Max. : 2.9865 Max. :3.000
“Visualize the results with a bi-plot.”
lda.fit <- lda(cluster ~ ., data = boston_scaled[, -4])
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
> “Interpret the results. Which variables are the most influential
linear separators for the clusters?”
I would say indus (proportion of non-retail business
acres per town) is useful for separating Group 3 out, and
tax full-value property-tax rate per $10,000 is useful for
distinguishing Group 1 and 2.
This week we follow through the examples in Exercise Set 5.
date()
## [1] "Sat Dec 9 17:28:37 2023"
Create this file and include it in index.Rmd.
library(readr)
human <- read_csv("data/human.csv", show_col_types = FALSE)
“Move the country names to rownames”
library(tibble)
human <- column_to_rownames(human, "Country")
“Show a graphical overview of the data”
library(GGally)
library(ggplot2)
ggpairs(human, progress = FALSE)
“Show summaries of the variables in the data”
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(corrplot)
cor(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI
## Edu2.FM 1.000000000 0.009564039 0.5760299 0.59325156 0.43030485
## Labo.FM 0.009564039 1.000000000 -0.1400125 0.04732183 -0.02173971
## Life.Exp 0.576029853 -0.140012504 1.0000000 0.78943917 0.62666411
## Edu.Exp 0.593251562 0.047321827 0.7894392 1.00000000 0.62433940
## GNI 0.430304846 -0.021739705 0.6266641 0.62433940 1.00000000
## Mat.Mor -0.660931770 0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415 0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F 0.078635285 0.250232608 0.1700863 0.20608156 0.08920818
## Mat.Mor Ado.Birth Parli.F
## Edu2.FM -0.6609318 -0.5294184 0.07863528
## Labo.FM 0.2404611 0.1201589 0.25023261
## Life.Exp -0.8571684 -0.7291774 0.17008631
## Edu.Exp -0.7357026 -0.7035649 0.20608156
## GNI -0.4951623 -0.5565621 0.08920818
## Mat.Mor 1.0000000 0.7586615 -0.08944000
## Ado.Birth 0.7586615 1.0000000 -0.07087810
## Parli.F -0.0894400 -0.0708781 1.00000000
library(corrplot)
correlation_matrix <- cor(human) %>% round(digits = 2)
corrplot(correlation_matrix, method="circle", diag = FALSE, type = 'lower')
ggplot(human, aes(x = GNI)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black", alpha = 0.7)
“Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them”
What can we say about the distributions of the variables? It is somewhat reminiscent of Hans Rosling’s lectures… And because the hand-picked nature of our basket of variables, it is not surprising the scatter plots show some noticeable patterns. For GNI for example we can see there are a few very rich countries, a bunch of middle-income countries, then many poor countries.
Going from the Rosling perspective (public health and human development), we can see for example that Life Expectancy and Maternal Mortality are strongly correlated, with both being influence by health infrastructure etc. Life Expectancy also correlates with availability of education, which go hand in hand with economic development. We can also see that more women being educated correlate to lower Maternal Mortality as well as teenage birth.
Another interesting distribution is GNI / Parli.F, which shows that although some countries are quite rich, they didn’t become more equal, as the scatter shows a fairly normal spread across income levels (side ways), in fact there are a few outliers of very rich countries with almost no female parliamentary representation.
We can see from the correlation matrix that the variables we have selected have a general tendency to overlap with each other strongly, either positively or negatively, not necessarily because they are synonymous or antonymous with each other, but because they reflect different dependent facets of some shared underlying influence or condition. The weak circles (low correlations) are also interesting, for example, improved health development doesn’t translate to women working more.
“Perform principal component analysis (PCA) on the raw (non-standardized) human data”
pca_human <- prcomp(human)
“Show the variability captured by the principal components”
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
“Draw a biplot displaying the observations by the first two principal components along with arrows representing the original variables”
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
“Standardize the variables in the human data and repeat the above analysis”
human_standardised <- scale(human)
pca_human_standardised <- prcomp(human_standardised)
summary(pca_human_standardised)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.071 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.536 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.536 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
biplot(pca_human_standardised, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
“Interpret the results of both analysis (with and without standardizing)”
“Are the results different? Why or why not?”
The results are very different before and after scaling. The unscaled plots are virtually unreadable. This is more or less expected, because of vastly differing units: GNI is in some kind of dollars so it can get to 200k, where as a ratio is sub-one. The biplot has GNI squashed to a horizontal arrow, so that PC1 repeats GNI.
Interpreting the scaled biplot is a lot more sensible. We can see that Life Expectancy, years of educations a child can expect, also female-to-male ratio of achieving secondary education are all correlated with each other and with money, and they are negatively correlated with Maternal Mortality and teenage births. All of these variables have arrows that are close to parallel to PC1.
Whereas, women in parliament and in workforce are correlated, and close to parallel with PC2.
See more comments below.
“Include captions (brief descriptions) in your plots”
biplot(pca_human_standardised, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
legend("bottomright", legend = c(
"Life.Exp: Life Expectancy ",
"Edu.Exp: Expected Years of Education ",
"GNI: Gross National Income",
"Mat.Mor: Maternal Mortality Ratio",
"Ado.Birth: Adolescent Birth Rate",
"Parli.F: Percentage of Women in Parliament",
"Edu2.FM: Female-to-Male Ratio of Secondary Education",
"Labo.FM: Female-to-Male Ratio of Labor Force Participation"
), cex = 0.2)
“Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data”
The principal components can sometimes be hidden factors that do not directly map to a known concept, or sometimes to a very abstract quality or measure that’s hard to pin-point, and we cannot over-generalise that it’s “richness” or “goodness” or “development”. One can only make rough guesses what kinds of qualities the axes may correspond to, based on heuristics (namely generalisation from pre-loaded perception).
In the biplot, the points (labels of country names) are observations mapped along the 2 main PCs. Points close to each other share similar values as measured by the PC axes. Here, using our preconceived priors as heuristics we can gauge, ah okay, Qatar is on one side and Afghanistan on the other side, so this axis has something to do with rich vs. poor. Then you see the Nordics up there and Iran down there, you wonder if that axis has something to do with the permissiveness towards certain things, which could be bacon for all we know. Here I must apologise for my geographical ignorance that I really don’t know what it’s qualitatively and subjective like in those poorer countries who happen to be quite high up in terms of PC2. Kigali is a vibrant start-up scene with hipster coffeeshops nowadays, as rumours say, which could relate to the UK government’s interest in sending people there.
If two arrows are radially close to each other (that is, having similar clock-angles on the chart), it means the two original variables co-vary more closely – if they point the same way they co-vary positively, if they point almost 180-degree opposite ways, they co-vary negatively.
We can think of the PC axes also like the arrows, if an arrow is close to parallel with an axes, its corresponding original variable is more aligned with that PC. The size of the arrow indicates how much of a PC is explained by that original variable.
“Load the tea dataset and convert its character variables to factors”
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
“Look at the structure and the dimensions of the data”
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
“Use View(tea) to browse its contents, and visualize the data”
# View(tea)
# Run this for one's own viewing leisure
“Use Multiple Correspondence Analysis (MCA) on the tea data (or a subset of columns)”
library(FactoMineR)
subset_tea <- tea[, c("Tea", "How", "how", "sugar")]
mca <- MCA(subset_tea, graph = FALSE)
“Interpret the results of the MCA”
summary(mca)
##
## Call:
## MCA(X = subset_tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.338 0.292 0.272 0.261 0.242 0.227 0.208
## % of var. 16.924 14.598 13.609 13.031 12.088 11.337 10.389
## Cumulative % of var. 16.924 31.522 45.131 58.162 70.250 81.587 91.977
## Dim.8
## Variance 0.160
## % of var. 8.023
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.102 0.010 0.008 | 0.251 0.072 0.046 | -0.070
## 2 | 0.521 0.267 0.128 | 0.730 0.609 0.251 | -0.622
## 3 | -0.045 0.002 0.003 | -0.227 0.059 0.074 | -0.322
## 4 | -0.625 0.385 0.534 | -0.109 0.014 0.016 | -0.130
## 5 | -0.045 0.002 0.003 | -0.227 0.059 0.074 | -0.322
## 6 | -0.045 0.002 0.003 | -0.227 0.059 0.074 | -0.322
## 7 | -0.045 0.002 0.003 | -0.227 0.059 0.074 | -0.322
## 8 | 0.521 0.267 0.128 | 0.730 0.609 0.251 | -0.622
## 9 | 0.057 0.003 0.002 | 0.343 0.134 0.063 | -0.320
## 10 | 0.946 0.881 0.533 | 0.104 0.012 0.006 | 0.100
## ctr cos2
## 1 0.006 0.004 |
## 2 0.474 0.182 |
## 3 0.127 0.149 |
## 4 0.021 0.023 |
## 5 0.127 0.149 |
## 6 0.127 0.149 |
## 7 0.127 0.149 |
## 8 0.474 0.182 |
## 9 0.125 0.055 |
## 10 0.012 0.006 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.161 24.547 0.441 11.485 | 0.745 11.739 0.182
## Earl Grey | -0.532 13.455 0.511 -12.358 | -0.031 0.054 0.002
## green | 0.509 2.107 0.032 3.096 | -1.489 20.894 0.274
## alone | 0.124 0.736 0.028 2.918 | -0.476 12.611 0.421
## lemon | -0.691 3.880 0.059 -4.201 | 0.295 0.822 0.011
## milk | -0.252 0.987 0.017 -2.249 | 0.817 12.002 0.177
## other | 1.617 5.795 0.081 4.918 | 3.511 31.670 0.381
## tea bag | -0.349 5.092 0.159 -6.897 | 0.140 0.949 0.026
## tea bag+unpackaged | 0.264 1.609 0.032 3.080 | 0.078 0.165 0.003
## unpackaged | 0.958 8.142 0.125 6.120 | -0.865 7.695 0.102
## v.test Dim.3 ctr cos2 v.test
## black 7.376 | 0.182 0.748 0.011 1.798 |
## Earl Grey -0.724 | 0.057 0.190 0.006 1.316 |
## green -9.054 | -0.739 5.516 0.067 -4.492 |
## alone -11.217 | -0.096 0.545 0.017 -2.251 |
## lemon 1.796 | 2.116 45.221 0.553 12.861 |
## milk 7.283 | -0.846 13.801 0.190 -7.541 |
## other 10.677 | 0.234 0.150 0.002 0.711 |
## tea bag 2.766 | -0.439 10.042 0.252 -8.685 |
## tea bag+unpackaged 0.917 | 0.316 2.869 0.045 3.688 |
## unpackaged -5.526 | 1.250 17.214 0.213 7.980 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.543 0.382 0.070 |
## How | 0.154 0.667 0.650 |
## how | 0.201 0.103 0.328 |
## sugar | 0.456 0.016 0.040 |
“Draw the variable biplot of the analysis”
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
“Comment on the output of the plots”
I can comment that it is indeed news to me that Earl Grey can also go with lemon. It’s quite expected that green tea does not go with milk like Earl Grey or black tea does, but I’m curious to find out what that “other” category might entail. Honey, perhaps? Also it’s curious what tea bag plus unpackaged means, perhaps it’s sometimes this sometimes that, but you can also see that for green tea drinkers, it’s more easily a snobbery thing where drinking loose leaves is perceived as somehow superior than just dunking a pre-manufactured bag in hot water.
This week we follow through the examples in Exercise Set 6.
date()
## [1] "Sat Dec 9 17:28:40 2023"
Read in the transformed datasets:
library(readr)
BPRS <- read_csv("data/BPRS_long.csv", show_col_types = FALSE)
RATS <- read_csv("data/RATS_long.csv", show_col_types = FALSE)
“Implement the analyses of Chapter 8 of MABS … but using the RATS data”
“Look at the (column) names”
names(RATS)
## [1] "ID" "Group" "Weight" "Time"
“Look at the structure”
str(RATS)
## spc_tbl_ [176 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:176] 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : num [1:176] 1 1 1 1 1 1 1 1 2 2 ...
## $ Weight: num [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : num [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Group = col_double(),
## .. Weight = col_double(),
## .. Time = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
“Print out summaries of the variables”
summary(RATS)
## ID Group Weight Time
## Min. : 1.00 Min. :1.00 Min. :225.0 Min. : 1.00
## 1st Qu.: 4.75 1st Qu.:1.00 1st Qu.:267.0 1st Qu.:15.00
## Median : 8.50 Median :1.50 Median :344.5 Median :36.00
## Mean : 8.50 Mean :1.75 Mean :384.5 Mean :33.55
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:511.2 3rd Qu.:50.00
## Max. :16.00 Max. :3.00 Max. :628.0 Max. :64.00
Convert categorical columns to factor
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
“Take a glimpse at the data”
library(dplyr)
glimpse(RATS)
## Rows: 176
## Columns: 4
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
“Draw the plot”
library(ggplot2)
rat_count <- length(unique(RATS$ID))
colour_palette <- sample(rainbow(rat_count))
range <- c(min(RATS$Weight), max(RATS$Weight))
ggplot(RATS, aes(x = Time, y = Weight)) +
# geom_smooth(aes(color = ID), method = "loess", se = FALSE) +
geom_line(aes(color = ID)) +
scale_color_manual(values = colour_palette) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = range)
“Standardise the variable”
Standardised_RATS <- RATS %>%
group_by(Time) %>%
mutate(Standardised_Weight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
“Glimpse the data”
glimpse(Standardised_RATS)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1,…
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, …
## $ Time <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8,…
## $ Standardised_Weight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0…
“Plot again with the standardised values”
library(ggplot2)
rat_count <- length(unique(Standardised_RATS$ID))
colour_palette <- sample(rainbow(rat_count))
range <- c(min(Standardised_RATS$Standardised_Weight), max(Standardised_RATS$Standardised_Weight))
ggplot(Standardised_RATS, aes(x = Time, y = Standardised_Weight)) +
geom_line(aes(color = ID)) +
scale_color_manual(values = colour_palette) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = range, name = "Standardised Weight")
Summarise data with mean and standard error of Weight by Group and Time”
Summarised_RATS <- RATS %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight) / sqrt(length(Weight)) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
“Glimpse the data”
glimpse(Summarised_RATS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <dbl> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87…
“Plot the mean profiles”
ggplot(Summarised_RATS, aes(x = Time, y = mean)) +
geom_line(aes(color = Group)) +
geom_ribbon(aes(ymin = mean - se, ymax = mean + se, fill = Group), alpha = 0.2) +
scale_y_continuous(name = "Weight")
Check for outliers
“Create summary data by Group and ID with mean as the summary variable
RATS_box <- RATS %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
“Glimpse the data”
glimpse(RATS_box)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
“Draw a boxplot of the mean”
ggplot(RATS_box, aes(x = Group, y = mean, fill = Group)) +
geom_boxplot(alpha = 0.2, width = 0.2) +
scale_y_continuous(name = "Mean Weight")
Comments: there is a fat rat in Group 2 but it’s not all that extreme.
Try a filtered plot anyway:
Group_2_filtered <- RATS_box %>%
filter(mean < 550, Group == 2)
ggplot(Group_2_filtered, aes(y = mean, x = Group)) +
geom_boxplot(width = 0.1) +
scale_y_continuous(name = "Mean Weight")
“Perform a two-sample t-test”
# Since the rats fall into 3 groups, let's try testing only the first two
rats_1_2 <- subset(RATS_box, Group %in% c(1, 2))
t.test(mean ~ Group, data = rats_1_2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -9.0193, df = 10, p-value = 4.059e-06
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -275.5817 -166.3956
## sample estimates:
## mean in group 1 mean in group 2
## 263.7159 484.7045
This reiterates what’s visible from the box plot, Group 1 & 2 are very different
Check Group 2 & 3, which are closer to each other:
rats_2_3 <- subset(RATS_box, Group %in% c(2, 3))
t.test(mean ~ Group, data = rats_2_3, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -1.1086, df = 6, p-value = 0.3101
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -131.79027 49.60845
## sample estimates:
## mean in group 2 mean in group 3
## 484.7045 525.7955
Due to the low number of rats in each group, we can’t be sure they are materially different.
Add a baseline column for the starting weight
RATS_wide <- read_csv("data/RATS_wide.csv", show_col_types = FALSE)
RATS_based <- RATS_box %>%
mutate(baseline = RATS_wide$WD1)
“Fit a model”
fit <- lm(mean ~ baseline + Group, data = RATS_based)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATS_based)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.732 -3.812 1.991 6.889 13.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.14886 19.88779 1.516 0.1554
## baseline 0.93194 0.07793 11.959 5.02e-08 ***
## Group2 31.68866 17.11189 1.852 0.0888 .
## Group3 21.52296 21.13931 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9933
## F-statistic: 747.8 on 3 and 12 DF, p-value: 6.636e-14
“Compute the ANnalysis Of VAriance table for the fitted model”
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the results are a bit baffling, baseline is highly significant, meaning that the rat’s starting weight explains its future weight.
Whereas, the diet is just about above the 0.05 threshold for significant, in over-generalised summary: diet doesn’t matter.
For future research, perhaps the relative gains in weight could be modelled, which, I guess, was pretty much what the original Part II in the exercise set with random intercepts method was about.
Even though the empirical evidence and analysis and plots pretty much tell the story already, I will nonetheless repeat it here with words so as to satisfy the requirements of having interpretations.
“Implement the analyses of Chapter 9 of MABS … but using the BPRS data”
“Factor variables”
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
“Glimpse the data”
glimpse(BPRS)
## Rows: 360
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ bprs <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
“Check the dimensions of the data”
dim(BPRS)
## [1] 360 4
“Plot the data”
ggplot(BPRS, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(col = subject), alpha = 0.5) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Week") +
scale_y_continuous(name = "Score") +
theme(legend.position = "none")
“Create a regression model”
BPRS_regression_model <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_regression_model)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Intercept is like the starting point, the week has an
estimate of -2.27, meaning score decreases by 2 points per week.
Being in a different treatment group doesn’t seem to do much (p = 0.661).
“Create a random intercept model”
library(lme4)
## Loading required package: Matrix
BPRS_RI_model <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_RI_model)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
“Create a random intercept and random slope model”
BPRS_RI_RS_model <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_RI_RS_model)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
“Perform an ANOVA test on the two models”
anova(BPRS_RI_RS_model, BPRS_RI_model)
## Data: BPRS
## Models:
## BPRS_RI_model: bprs ~ week + treatment + (1 | subject)
## BPRS_RI_RS_model: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_RI_model 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_RI_RS_model 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comments: Including a random slope within subjects significantly improves the model fit compared to a model with only a random intercept.
“Create a random intercept and random slope model with the interaction”
BPRS_interaction <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_interaction)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0513 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0050 8.0626
## week 0.9687 0.9842 -0.51
## Residual 96.4702 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Compare new best to old best:
anova(BPRS_interaction, BPRS_RI_RS_model)
## Data: BPRS
## Models:
## BPRS_RI_RS_model: bprs ~ week + treatment + (week | subject)
## BPRS_interaction: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_RI_RS_model 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_interaction 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nah, not better enough.
“Create a vector of the fitted values”
Fitted <- fitted(BPRS_interaction)
“Create a new column
fitted”
BPRS_fitted <- mutate(BPRS, Fitted = Fitted)
“Draw the plot with the Fitted scores”
library(ggplot2)
ggplot(BPRS_fitted, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(col = subject)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Week") +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "none")
So many models, it’s a spaghetti plot.